Add prepare command#38
Conversation
|
Note to self:
|
|
| from fast_llm.data.gpt.memmap import GPTMemmapDataset | ||
| import pytest | ||
|
|
||
| def dtype_arrays(dtype: np.dtype, min_size: int=1, max_size: int=100) -> st.SearchStrategy: |
There was a problem hiding this comment.
I'm not following what the hypothesis module brings here. You seem to be just creating a list of random arrays, is that right? This can easily be done in plain numpy with the same function complexity.
There was a problem hiding this comment.
The benefit is that hypothesis will try to shrink the inputs to the minimal reproducible value in case of a problem
jlamypoirier
left a comment
There was a problem hiding this comment.
LGTM, assuming my proposed modifications are ok
✨ Description
Extracted and refined the dataset preparation script from #17.
Made it a command like
trainorconvert.Example call and config:
or
where
foo.yamlcontains:Run
git clone https://huggingface.co/HuggingFaceTB/SmolLM-135Mintmpto get that tokenizer file.This will produce:
with
fast_llm_dataset.jsonreading:{ "datasets": [ { "prefix": "shard_0_0", "num_documents": 10000, "num_tokens": 11569536, "weight": 1.0 } ] }The
downloaded_datasetcan be deleted afterwards. It is not used by Fast-LLM.🔍 Type of change
Select all that apply:
📝 Changes
prepare_datasetcommandDockerfile✅ Checklist
Make sure the following tasks are completed before submitting the PR:
General:
Dependencies and Configuration:
Testing:
Performance Impact:
📊 Performance Impact Details
N/A
📝 Additional Notes
N/A